Shrinkage
Cornell College
STA 362 Spring 2024 Block 8
Ridge regression and Lasso - The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.
As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.
It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.
\[\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2+\lambda\sum_{j=1}^p\beta_j^2\]
\[ = RSS + \lambda\sum_{j=1}^p\beta_j^2\]
where \(\lambda\geq 0\) is a tuning parameter, to be determined separately
Like least squares, ridge regression seeks coefficient estimates taht fit the data well by making the RSS small.
The second term \(\lambda\sum_j\beta_j^2\) is called a shrinkage penalty, is small when \(\beta_1,...\beta_p\) are close to 0, and so it has the effect of shrinking the estimates of \(\beta_j\) toward 0.
Each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of \(\lambda\).
This displays the same ridge coefficient estimates as the previous graphs, but instead of displaying \(\lambda\) on the x-axis, we now display \(||\hat{\beta}_\lambda^R||_2/||\hat{\beta}||_2\), where \(\hat{\beta}\) denotes the vector of the least squares coefficient estimates.
The notation \(||\beta||_2\) denotes the
The standard least squares coefficient estimates are scale equivalent: multiplying \(X_j\) by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of \(1=c\). In other words, regardless of how the jth predictor is scaled, \(X_j\hat{\beta}_j\) will remain the same.
In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.
Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula
\[\tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{2}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}\]
How do you think ridge regression fits into the bias-variance tradeoff?
Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coefficients. Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of \(\lambda\) and \(||\hat{\beta}_\lambda^R||_2/||\hat{\beta}||_2\). The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest.
Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model
The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, \(\hat{\beta}_\lambda^L\), minimize the quantity
\[\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2+\lambda\sum_{j=1}^p|\beta_j|\]
\[ = RSS + \lambda\sum_{j=1}^p|\beta_j|\]
where \(\lambda\geq 0\) is a tuning parameter, to be determined separately